Name: Mohamed Ayman Mohamed

ID : 900182267

This is the new version of the spotify case study with EDA. In this notebook, I would use a new dataset to be able to answer more various questions.

In this notebook, I will try to answer the following

1. Is the number of tracks surging drastically overtime?

2. Does the time duration of songs decrease by time passing?

3. Does songs have more liveness rate with respect to the timeline?

4. Is more sad songs released by time passing?

6. Is there a directed relationship between danceability, Energy, and instrumentalness?

1. Understanding The data

  1. artists: The list of artists of the song.

  2. danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

  3. duration_ms: The duration of the track in milliseconds.

  4. energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. (Float)

  5. explicit: The content item is explicit and the user’s account is set to not play explicit content. Additional reasons may be added in the future. Note: If you use this field, make sure that your application safely handles unknown values.

  6. instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

  7. key: The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

  8. liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

  9. loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

  10. mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

  11. name: Name of the song.

  12. popularity: The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.

  13. release_date: The date the album was first released, for example “1981-12-15”. Depending on the precision, it might be shown as “1981” or “1981-12”.

  14. speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

  15. tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

  16. valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

  17. year: Year information extracted from release_date.

  18. genres: A list of the genres used to classify the album. For example: “Prog Rock” , “Post-Grunge”. (If not yet classified, the array is empty.)

1.2 Checking whether the data is corrupted or not

1.3 Checking whether the data has missing values

Conclusion to section 1:

It is clear to notice that dataset does not have any duplicates. However, there is a problem of missing values in the feature name. As mentioned before, dealing with proper names (converting to meaningful, numerical representation is quite difficult). For simplicity, it will be dropped. Second, we need to convert any categorical data to one hot encoding. The categorical data are [ mode, key, time_signature, explicit]. I will need to do this before data visualization.

artist_name feature is quite problematic, as it includes important value for the data, as the popularity has a great effect from the artist names, theoratically- I need to prove it later :). We have two options to convert this feature string to numerical representations. Text Embedding is the best option to be able to take semantics into consideration with taking huge size from the storage. Another option is using one Hot Encoding [114030 new features]. However, we will need to have a powerful PC to pre-process such a thing. We cannot represent it as unique IDs, as unique IDs will noise the data more (we need to have it in higher dimensions). For sake of simplicity, I will drop it.

3.artist_id is not necessery at all for the data

4. Regarding time Release feature, we just need the years

5. Converting duration unit to be in minutes.

6. Transforming the loudness to be the minimum 0 instead of negative number

2. Preparing the data for just data Visualization

2.1 Deleting the name, id, id_artists, and artists Features

3. Data Visualization

3.1 Visualizing the distribution of key and mode in the data

3.2 Visualize each Feature using skewness and Kurtosis:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm#:~:text=Skewness%20is%20a%20measure%20of,relative%20to%20a%20normal%20distribution.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

download.png

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution (4th moment in the moment-based calculation). That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

800px-Standard_symmetric_pdfs.svg.png

3.3 Visualize THe BoxPlot with each feature in Univariate and Multivariate analyses to help us detect the Outliers

3.3.1 Uni-variate(one variable outlier analysis) Outliers:

From the three sections above, we can clearly see that there is skewing in the data. In this sense, we need to take outliers into consideration. However, the problem is that we need to rely more on multi-variant outliers.

3.3.2 Multi-variate(More Than one Variable outlier analysis) Outliers:

Conclusion:

Generally, We cannot take into consideration the univriate outlier analysis in the features alone. We have a solution, which is using PCA. Another solution is to see the correlation between all the variables to see which one is better. I have tried to analyze the outliers from skewness, univariate, and multivariate plots, and reach a meaningful conclusion. After constructing some boxes and plots in higher dimensions to see the outliers, we can notice that there are some outliers in duration_m.

3.4 Number of Tracks each year

Conclusion:

We can See that the number of tracks increase but not linearly. Also, there are some drops in specific years. Intutively, we can see that the number of tracks has been increased dramatically overtime

3.5 Liveness vs Time

Conclusion:

We can see from the bar chart above that it is slowely decreasing. It means that the liveness rate is decreasing overtime

3.6 Track Duration vs Time

Conclusion:

We can see from the bar chart above that it is slowely decreasing. It means that the liveness rate is decreasing overtime

3.7 Valence vs Time

It is very difficult to cliam that the sad songs have been released much more than happy songs, but by time passing, we can notice that the valence is somhow decreasing slowly.

3.8 BiVariate Data Analysis

1_bB56fmAsXp9FKcfCsv9P1g.png

3.8.1 Using Heatmaps Representing Correlations

3.8.2 Checking the correlations between continous features

3.8.3 Popularity vs Danceability

Conclusion:

It is obvious to see that when the type of the song is somehow danceable, the popularity of the song is quite high. Although there are some outliers at the beginning undermine the claim, but, overall, there is a strong relationship between them.

From the 3 graphs above, there is a strong correlation between danceability and energy. Also, there is a correlation between energy and insturmentlness. However, there is no relationship between danceability and insturmentlness.

We can notice that there are correlations between features, but popularity is the more connected to the other features. Connection means that there is a strong correlation.

Checking if two categorical variables are independent can be done with Chi-Squared test of independence.

Chi-Square test of independence is most commonly used to test association between two categorical variables. The output gives us p-value, degrees of freedom and expected values.

The Steps for using Chi-Square Test:Setting Hypothesis, Prepare Contingency Table, Getting Expected Value Count, Comparing Observed Value with Expected Value and concluding the Hypothesis

We know that if Pvalue >0.05, the two features are independent. (Accepting Null Hypothesis)

Popularity vs artist Name

4. Data Wrangling and Feature Engineering

I dropped unnecessary features during visualization.

4.1 Handling the problem of Outliers

Outliers lead to many problems "sensitivity to the noise". Now, we can solve the problem using different techniques. If we delete all outliers, the majority of the data will be deleted. If we leave the outliers, problems will occur. Also, we saw from multi-variate analysis, most of the outliers occur with duration time

Through research, I have found that the features whose interval between [0,1] have come from APIs, so it does not mean that they are outliers. However, duration_ms is pointy (according to kurtosis), so we will remove the outliers with duration_ms only

4.2 Converting all categorical features to one hot encoding

4.3 Seperate the label (ground Truth from the features and give to each label a unique id)

4.5 Handling negative Numbers

We can see from above, after rescaling, skew and kurtosis approaches to the symmetric state.

4. Data Modeling (Clustering)

There are a number of clustering methods, as Affanity propgation, Expectation Maximization, Kmeans, and Hirechical Clustering. Affinity propgation and Expecation maximization needs huge memory to be able to store the data. More clearly, we know that affinity propgation creates three 2D matrix to calculate the exemplers with calculating responsability, availability, and self-availbility. Also, we know that K-means clustering is a special case of EM. EM is using distributions (mean and variance) to form clusters (Soft Clustering). Generally, I will be using K-means for visualizing the data, as it is the simplist one in terms of complexity

4.1 K-means Clustering (Hard Clustering) I will use Elbow Analysis to get the best number of clusters

We can see below that according to Distortion, it is better to choose the number of clusters between 10 to 15, while inertia method proves that 5 is the best. Due to huge data points, we will not be able to distinguish between them clearly. We use the Cluster when the following equation image.png is minmized abruptly

4.1.2 Using Distortion

4.1.2 Using Inertia

Just to draw the clustering for visualization. I will take a subset from the data

5. Training the Model with Pre-processed DataSet

Now the data is ready for training. Since we are only doing EDA, we do not need to train the data on a ML model. Generally, the problem is a classification problem (multiclass) [low, medium, High]. We can use neural network, logistic regression, SVM, Decision trees, and any other classifiers to solve the problem

Refrences:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://realpython.com/k-means-clustering-python/

https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/

https://www.kaggle.com/adisrw/spotify-data-analysis-using-python/data

https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745

https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365

https://medium.com/@atanudan/kurtosis-skew-function-in-pandas-aa63d72e20de

https://t.co/rurMFoBmlY